Tags: data engineering*

0 bookmark(s) - Sort by: Date ↓ / Title /

  1. This article introduces Streamlit, a Python library for building data dashboards, as a solution for Python programmers to create graphical front-ends without needing to delve into CSS, HTML, or JavaScript. The author, a seasoned data engineer, explains how Streamlit and similar tools enable the creation of attractive dashboards, marking a shift from traditional tools like Tableau or Quicksight. This piece serves as the first in a series focusing on Streamlit, with future articles planned on Gradio and Taipy. The author aims to replicate similar layouts and functionalities across dashboards using consistent data.
  2. This article explains the challenges of data integration in modern systems and how Apache Kafka addresses these issues by providing a decoupled, scalable, and maintainable architecture through its publish-subscribe model. The article covers Kafka’s architecture, core concepts, and benefits for real-time data streaming and event-driven systems.
  3. These one-liners provide quick and effective ways to assess the quality and consistency of the data within a Pandas DataFrame.

    | Code Snippet | Explanation |
    | --- | --- |
    | `df.isnull().sum()` | Counts the number of missing values per column. |
    | `df.duplicated().sum()` | Counts the number of duplicate rows in the DataFrame. |
    | `df.describe()` | Provides basic descriptive statistics of numerical columns. |
    | `df.info()` | Displays a concise summary of the DataFrame including data types and presence of null values. |
    | `df.nunique()` | Counts the number of unique values per column. |
    | `df.apply(lambda x: x.nunique() / x.count() * 100)` | Computes the percentage of unique values for each column. |
    | `df.isin( value » ).sum()` | Counts the number of occurrences of a specific value across all columns. |
    | `df.applymap(lambda x: isinstance(x, type_to_check)).sum()` | Counts the number of values of a specific type (e.g., int, str) per column. |
    | `df.dtypes` | Lists the data type for each column in the DataFrame. |
    | `df.sample(n)` | Returns a random sample of n rows from the DataFrame. |
  4. An article on building an AI agent to interact with Apache Airflow using PydanticAI and Gemini 2.0, providing a structured and reliable method for managing DAGs through natural language queries.

    - Agent interacts with Apache Airflow via the Airflow REST API.
    - Agent can understand natural language queries about workflows, fetch real-time status updates, and return structured data.
    - Sample DAGs are implemented for demonstration purposes.
  5. Breser stands for Business Rules & Expression Syntax for Easy Retrieval. It is a powerful and flexible query language designed for efficient log processing and structured data filtering.
  6. A seven-week structured self-paced study guide for learning Apache Iceberg and its ecosystem, created after the author realized its increasing relevance in the data industry.
    2024-12-15 Tags: , , by klotz
  7. Apache Iceberg is emerging as a cornerstone for data lakes and lakehouses in the modern data stack, drawing parallels to the rise of Hadoop a decade ago. This article explores these similarities, highlighting both the opportunities and challenges that Iceberg presents for data engineering.
  8. A detailed exploration of Amazon S3 Tables, a new solution for scalable storage and management of tabular data leveraging Apache Iceberg, including features, setup, security, and benefits over traditional storage methods.
  9. An article detailing how to build a flexible, explainable, and algorithm-agnostic ML pipeline with MLflow, focusing on preprocessing, model training, and SHAP-based explanations.
  10. The article discusses the rise of Apache Iceberg as the dominant open table format, backed by major endorsements, and outlines key developments expected for 2025 such as Role-Based Access Control (RBAC) catalogs, Change Data Capture (CDC) capabilities, and materialized views.

Top of the page

First / Previous / Next / Last / Page 1 of 0 SemanticScuttle - klotz.me: tagged with "data engineering"

About - Propulsed by SemanticScuttle